By the end of this workshop you should be able to:
That’s a lot of learning goals for this amount of time, so we better get started.
Reproducible research is the idea that another researcher should be able to take your code and your data and reproduce your analytic results! (Different from, but related to, the also-important concept of replicability.)
What do you need to reproduce an analysis?
A lot of the time, people want you to start with 7000 things all at the same time and make your research automatically reproducible and use a bunch of different softwares. Reproducible research can often feel like an unattainable cliff that will make you think “ok, well I can’t do all that, so why even try?”
IMO this reproducible research stuff is often incredibly overwhelming. People seem to want you to transfer from SPSS to makefiles and automated code instantly. But I don’t want you to do that.
Instead of talking about “best” reproducible research practices, let’s talk about “good enough” reproducible research practices. You know, stuff that will be good enough to get you through any federal funding mandates for reproducibility that come out.
If you want to know more about the motivations for reproducible research and open science in general, you can see my powerpoint for that on my website. I also have a bit of a longer walkthrough of some RR concepts you can see here, but that really needs to be cleaned up. I also have a page that I wrote for my adviser Andreas’ online course, which you can find here. He edited that one, so it’s maybe better.
For an example of a reproducible research project, you can check out my collaborator Dr. Brian McKay’s paper that you can find on data dryad. You can actually download all the data and code from here, and the last time I checked it, I could still get all the code to run on my computer.
Let’s go ahead and get started with those “good enough statistical practices.”
How does R find your files?
To talk about that, we probably need to talk about what a file tree is.
R has a neat thing called the working directory that makes the whole tree easier to navigate.
For example, if I need you to go get something from my house, and you got no clue where that is, I have to tell you: 1. Get to Athens, GA. 1. Go to [Zane’s House Address]. 1. Go inside and go upstairs. 1. Go into the first room. 1. Go to the bookshelf. 1. Look at the third shelf. 1. Grab the book I asked you for.
Of course, if you already know where my house is, I can skip the first few bullet points – this is the same thing that the working directory is doing. The working directory is an address in the file tree that says “OK R, when I tell you to look for a file, start here.”
getwd()
## [1] "D:/proj/reproducible-research-demo-2023"
# setwd()
Can you imagine any problems with this?
What if our mutual friend has the same book, and you want to get it from their house? You can’t follow the address to my house and get the book!!
Solutions: * The old way was to use relative file paths, but this is
really fragile. * New way: use an R Project and the
here package. * I recommend that everything read this
blog post.
here::here()
## [1] "D:/proj/reproducible-research-demo-2023"
Unfortunately, R has a tendency of asking whether you want to save your history when you close the program. You should always say NO! Why?
This is discussed more in that blog post I already linked by Jenny Bryan. In short, you want each run of your script to be independent!
YOU SHOULD DO THIS ON EVERY COMPUTER YOU SET UP WITH R!!
# install.packages("usethis")
usethis::use_blank_slate()
In the same line of thinking, it is tempting to think that all those things in your R environment are REAL. But they are NOT REAL. They are TEMPORARY things created by your code. And things in your R history are worse than not real, they are GHOSTS (which are not real and also scary).
Example project: https://github.com/wzbillings/Symptom-Agreement-Public
::functions, box, and
renvOften when people program in R, they type stuff like
library(MASS)
boxcox(mpg ~ wt, data = mtcars)
This is the easiest and most confusing way to deal with packages in
R. If you follow the (appropriate) recommendation that all your
library() invocations should go at the start of the script
(why is this a good recommendation?), your code might have a
lot of distance between the library() invocation, and the
invocation of the function from that package.
So it’s always better (for your future self and for others) to explicitly write it like this.
MASS::boxcox(...)
This solves multiple problems! For example, if you like tidyverse, but you do
library(dplyr)
library(MASS)
dat <- select(mtcars, mpg, wt)
You will be surprised to see that your code doesn’t work! But it
would if you instead said dplyr::select(...). So I
recommend that you always, ALWAYS use the :: when you call
a function from a package.
boxA lot of other languages force you to do this by default. For example, in Python if you wrote
import statsmodels.api as sm
Logit(y, x).fit()
you would get an error. You have to write
import statsmodels.api as sm
sm.Logit(y, x).fit()
instead. If you want to get away without the sm., you
have to do either
from statsmodels.api import Logit
Logit(y, x).fit()
or
from statsmodels.api import *
Logit(y, x).fit()
which imports all functions from statsmodels.api into
your environment. Note that this is what R’s library()
function does by default, and this is generally not recommended in
Python. R has some built in functionality like this, for example you can
do
library(dplyr, import.only = `%>%`)
Which lets you use the pipe operator in a normal way. A better way to
list our dependencies in R and avoid filling up the global
environment with stuff we won’t use is the box package.
box::use(dplyr)
We can look at their README for more info.
renvWe still have the issue of installing packages in the first place
though. If you send me your nicely formatted R project, I
still don’t have the packages you used, even if you put
box::use(my_cool_pkg) at the top of your script.
Suppose that you have my_cool_pkg v1.4 installed on your
computer. It works great, and you haven’t even thought about updating it
cause you never had any issues with it. However, you send me your code,
and I run install.packages("my_cool_pkg") and it gives me
v2.3. Somewhere along the line, they change the function
my_cool_pkg::fancy_regression() so that it does regularized
regression in v2.3, but in v1.4 it still does
plain unregularized regression. Then I’ll get different results from
you! And it will probably be hard to find out why.
That’s where renv comes in. renv is
fantastic, but it can be a bit tricky –fortunately every version so far
has gotten a bit easier to use. I recommend everyone read the main page
of their documentation, but we can walk through an example as well.
Even if you don’t follow anything else I say, this is the most important part! You can do all these reproducible research tricks, but they are still not very helpful if no one can understand what your code does.
The main downside here is that code can often get very complicated. Here’s an example. First I’ll show the easy to read, good version.
lm(mpg ~ wt, data = mtcars)
Now imagine if instead of that I did this.
my_function <- function(formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
{
ret.x <- x
ret.y <- y
cl <- match.call()
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action",
"offset"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
if (method == "model.frame")
return(mf)
else if (method != "qr")
warning(gettextf("method = '%s' is not supported. Using 'qr'",
method), domain = NA)
mt <- attr(mf, "terms")
y <- model.response(mf, "numeric")
w <- as.vector(model.weights(mf))
if (!is.null(w) && !is.numeric(w))
stop("'weights' must be a numeric vector")
offset <- model.offset(mf)
mlm <- is.matrix(y)
ny <- if (mlm)
nrow(y)
else length(y)
if (!is.null(offset)) {
if (!mlm)
offset <- as.vector(offset)
if (NROW(offset) != ny)
stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
NROW(offset), ny), domain = NA)
}
if (is.empty.model(mt)) {
x <- NULL
z <- list(coefficients = if (mlm) matrix(NA_real_, 0,
ncol(y)) else numeric(), residuals = y, fitted.values = 0 *
y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
0) else ny)
if (!is.null(offset)) {
z$fitted.values <- offset
z$residuals <- y - offset
}
}
else {
x <- model.matrix(mt, mf, contrasts)
z <- if (is.null(w))
lm.fit(x, y, offset = offset, singular.ok = singular.ok,
...)
else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
...)
}
class(z) <- c(if (mlm) "mlm", "lm")
z$na.action <- attr(mf, "na.action")
z$offset <- offset
z$contrasts <- attr(x, "contrasts")
z$xlevels <- .getXlevels(mt, mf)
z$call <- cl
z$terms <- mt
if (model)
z$model <- mf
if (ret.x)
z$x <- x
if (ret.y)
z$y <- y
if (!qr)
z$qr <- NULL
z
}
my_function(mpg ~ wt, data = mtcars)
These will give you the exact same output, but it’s pretty clear why you would want to use one instead of the other, right? This is demonstrating the importance of scripts and functions!
There are other options for literate programming, but right now the standout platform is Quarto. Quarto is newer and thus somewhat more buggy and less feature-full than R Markdown. And you have to install another program. But I highly recommend Quarto. R markdown, Jupyter, and iJulia/Pluto.jl are all pretty good and also free.
A good design pattern is to organize your code into functions, your
functions into scripts, and then source() those scripts
into your literate programming documents.
Of course I think it’s easier to learn by example instead of by me talking about it in these notes.
csl file controls the appearance of your
bibliography. You can find one for pretty much any journal here.